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Abstract 


Computer virion systems employ, seance of i mage und enting vision 

an algorithm is the input of dm dreomixtsitioo and efdcienl load balandng techniques 

varying characteristics, and therefore, require different po ^ output ^ 0 f previous 

for parafiel implementation Howev^ £ ^tor^position and load balancing. This 

task, this information can be exploited to perfo ® . ,i- rechniaues for vision systems. These 

pmsems seveml techniques to perfonn stauc and dynamtc ^ ^ 

techniques are novel ^Z^Slt^SSuse many algorithms in different 

when it is produced. Furthermore, they can be appuw to y j techniques are evaluated by applying 

systems are either same, or have similar computational tem a hy percube multiprocessor 

them on a parallel implementation of the ‘ n * n ext 4tion of fences, 2) stereo match of 

system. The motion estimation system consists o HiffJrpnt time instants 4) stereo match to compute final 

iLges in one time instant. 3) time match o gains when these 

Z"SSn ^■,2d C 'SS"^hni,o K Lc used am signidcam and dm overhead of using these iech- 

niques is minimal. 
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1. Introduction 

Computer vision tasks employ a broad range of algorithms. In vision system many algorithms with different 
characteristics and computational requirements are used in a sequence where output of one algorithm becomes the 
input of the next algorithm in the sequence [1,2). An example of such a system ts a motion estimation systems. In 
atch a sysrem, a sequence of images of a scene are used to compute the modon pantmeters of a moving object in the 
scene. Figure 1 shows the computational Bow for a motion estimation system in which stereo images (La, and R^) 
at each time frame are used as die input to the sysrem. Briefly, the invohred tasks (or aigorithms) in tins system are 
as follows. The first algorithm is computation of sen. crossings of the inreges (edge detection (Z, c and R„)). The 
sere, crossings are used as fearnre points for bod. stereo and dme marehing. The srereo mtuch algorithm provides 
points to compute 3-D inforauttion about die object in ihe scene. Using these matched points (Lsm and Rsm), the 
corresponding points in the image in the next dme fnune (I*.) are located and dds task is performed by time match 
algorithm. Again, stereo match is used to obtain the corresponding 3-D points in the next image frame. These two 
sets of points provide informadon to compute the modon parameters. Tire above process is repeated for each new 


set of input image frame. 

This paper presents techniques to prifotm efficient dam decomposition and load balancing for vision systems 
for medium to huge grain panllelism. Two imponant characteristics of these techniques are that they are general 
enough to apply to many vision systems, and that they use stadsdcs and knowledge from execuuon of a task to 


Lim(fi) 

Rim(rj) 



Lim( r{ + i) 

Rim(f,+i) 


ZC: Convolution and Zero Crossings 
TM: Time Match MP: Motion Parameter Computation 

Figure 1 : Computation Flow for Motion Estimation 


SM : Stereo Match 
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perfonn daa decomposition and load balancing for the next ask. For .sample, in the moUon estimation system 
-m...... knowledge can be oboined about die output daa ftom the zero crossing step to perfotm daa decomposi- 

don and load balancing for the steteo matching step. The advamages of such schemes are as follows. First, these 
-u-hp w. use characteristics of asks and daa, and therefore, work well no matter how daa changes. Second, 
man, vision systems consist of such asks and exhibit the above described compuouon Bow, and therefore, these 
techniques can be used in any system (e.g., object recognition, optical flow etc.) 12). 

The per formance of the proposed techniques is evaluated by using a parallel implemenotion of the motion 
estimation system algorithms on a hypereube multiprocessor sysrem. The results show dot using unifonn pariition- 
ing. without considering the compuadons involved, parallel [eocessing does not provide significant performance 
improvements over sequential processing. Furthermore, by applying the proposed daa decompostuon and load 
balancing techniques significant performance gains (as much as 6 fold) can be obtained over unifonn partitioning. 

This paper is organized as follows. In Section 2 we provide a brief description of each step in the motion esti- 
mation system . For adetailed description, tire reader is refereed to (3.4). These algorithms will provide insight into 
the involved compuotions in the motion estimation system. Section 3 describes the proposed load balancing and 
daa decomposition techniques. In section 4 we present a parallel imptemenation of these algorithms in an 
environment on a hypereube multiprocessor, and discuss the perforenance results for each of these algo- 
rithms and daa decomposition and load balancing schemes. Some of these techniques have been applied to outer 
inregrated vision systems and have Ireen shown to work well [2. 5). Finally, concluding remarks are presented. 

2. Steps in the Motion Estimation System 

The motion estimation system consists of the following steps: 1) extraction of features, 2) stereo match of 
images in one time insont, 3) time match of images from different time insonts. 4) stereo match to compare final 
tmambiguous points and. 5) compuation of motion paramerere. We will no. discuss, tire last process, calculation of 
motion parameter!, but a discussion on how to compute them can be found in (6). The malchtng algonlhms use 
srereo image pahs, and the algorithms are designed to find point coreespondences between wo consecutive time 

■ be., and f,. From ihe point coreespondences. we can estimate tire motion parameters. Typical stereo 

image pairs at two consecutive time insonts « 7 and f ,) used in this paper are shown in Figure 1 which are outdoor 
scenes of truck at different locations. The images are segment out from larger unages of size 1024x1024. The 
setup used in oking the images is parallel axis mediod The feature points used in the matching process are 
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edge points which are considered as the more reliable features obtained front an imago. In order to save consider- 
able compulation lime, the matching process in done b, employing non-itenrti»e procedures with the assumption of 
limited displacement (or disparity) between frames. We apply the matching algorithm on two stereo image pahs at 
two consecutive time instants r 7 and f g. The following is a brief description for each major step of the motion esu- 

mad on system. 

2.1. Feature Points 

The feature points used in this algorithm are zero crossing points of an image. We use the method suggested 
by Huertas and Medioni in [7] to extract the zero crossings of an image. In order to eliminate non-significant zero 
crossing points and maintain enough details, we threshold the zero crossing image based on the intensity gradient at 
each zero crossing point. Figure 3 depicts one of the thresholded zero crossing images, 1 7 . We associate each zero 
crossing point with one the sixteen possible zero crossing patterns as suggested and used by Kim and Aggarwal [81. 
The patterns are not used directly; instead, we assign each pattern a value according to its local connectivity. These 

pattern values are used in the matching process. 

22 . Matching 

Once zero crossings are extracted in all the involved images, the matching process is applied to find point 
correspondences among the images (two stereo image pairs at two consecutive time instants). The evidences used 
in this process to obtain matched point pairs axe the normalized correlation coefficient, and the zero crossing pattern 
values. Furthermore, in order to limit the search space, the assumption of limited displacement or disparity between 
frames is exploited. The matching process consists of six steps as follows: 

1) Perform stereo (from left to right) matching in the f,-i stereo image pair. 

2) Obtain unambiguous matched point pairs by eliminating multiple matches. 

3) Perform time matching between the unambiguous matched points in the left r t _i image and the 
feature points of the left r, image. 

4) Obtain unambiguous matched point pairs from the time matched points by eliminating multiple 
time matches. 
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5 ) Perform stereo matching between the unambiguous matched points (obtained in step (4» in the left 
fj image and the feature points of the right ti image. 

6) Obtain unambiguous matched point pairs from the results of h stereo matching by eliminating mul- 
tiple matches. 

The results of the above steps are two sets of unambiguous stereo matched point pairs at time instant f.-i and 
These two sets are related through steps (3) and (4). the matching over time; therefore, we can pick out all the 
unambiguous matched points that correspond to each other among the two stereo image pairs at time instants f,-i 
and ^ The matching algorithm was applied to the images shown in Figure 2. The final results are depicted in Fig- 
ure 4, which shows that we have enough point correspondences for the motion estimauon. 


3. Data Decomposition and Load Balancing Techniques for Parallel Implementation 

In a multiprocessor system the simplest method to implement a task in parallel is to decompose the data and 
equally and uniformly among the processors. In a completely deterministic computation in which the computauon is 
independent of the input data such schemes perform well, and normally, the processing time is comparable on all 
the processors. That is, efficient utilization and load balancing can be obtained. For example, regular algorithms 
such as convolutions, filtering or FFT exhibit such properties. The amount of computation to obtain each output 
point is the same across all input data. Therefore, uniform decomposition of data results in load balanced implemen- 
tation. 

Mo* other algorithms do no. exhibit a regular structure, and the Involved computauon is nonually dam 
dependent. Furthermore, the computation is no. uniformly distributed across the input domain. In such cases, a sim- 
ple decomposition of data does not previde efficient mapping, and results in poor utilizadon and low speedups. 
Also, the performance cannot be predicted for a given number of processors, and a given data size, because the 
computation vanes as we of data and hs distribution varies. For example, in the stereo match algorithm, the com- 
putation is more where feature points are dense, and is compareUvely small where number of features is small and 

sparsely distributed (Figure 3). 

In a vision system, it is important to efficiently allocate resources and perform load balancing at each step to 
obtain any significant performance gains overall. An important characteristic of such systems is that the input data 
of a task is the output of the previous task. Therefore, while computing the output in the previous task enough 



(b) Left and right images at time instant t g 
Figure 2 : Images set of 1 7 and 1 8 















Alok Choudhary 


7 




m 


Alok Choudhary 


8 




(a) : At time instant 1 8 

Figure 4 : Unambiguous matched points of Figure 2 
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knowledge about the data can be obtained to perform efficient scheduling and load balancing. 

Consider a parallel implementation of a task on n processor parallel machine. Let T, (l<i <n) denote the 
computation time at processor node i. Then the overall computation time for the task is given by 

T max = max{Ti,...,TJ 

The total wasted time (or idle time) T w is given by 


(3) 


i=n ( 2 ) 

r w = £(r m »- Ti) 

If T max =V- for all i, l£i£n, then the task wiU be completely load balanced. Another measure of imbalance is 
given by the variation ratio V, 

v = r min = miner i , 

The goal in forming load balancing is to minimize T w , or move V as close to 1 as possible. In the best case. 
T w = 0 or V = 1 . 

If T seq is the time to execute the same task on a sequential machine then the speedup is given by 

T, 


S B = 


1 seq 


(4) 


J P j 


max 


Therefore, by miitimkirig T„, the achievable speedup can be nuuumized. In the following we discuss such 
.-ct,,,,.. and in die next section we present the performance results for a parallel implementation of algorithms in 

the motion estimation system. 


3.1.1. Uniform Partitioning 

Data decomposition using uniform partitioning performs well as a load balancing strategy for input data 
independent tasks, because equally dividing the data distributes the computation equally among processors. If total 
input data size is D then total computation time to execute a task is T = k*D, where k is determined by the com- 
putation at each input data point. For example, in convolution of an image with mxm kernel, k = 2xm 2 floating 
point operations. Hence, for an n node multiprocessor, the data decomposition methods to balance the computation 

is to make the granule size to 

. D (5) 
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For data impendent algorithms, such a pariitioning guarantee equal distribution of computation among pro- 
cessors. Therefore, if communication time can be minimized, then optimal performance can be obtai g 

multiprocessor. 


3.1.2. Static 

Whea computation is nof aaifonaly distributed across the iapa. domain, aad is dam dependent, uniform parti- 
tioning does not work well for load balancing. Nonnally. compnlation depends on significant dam elements in a par- 
tition. Man, vision algorithms exhibit this behavior. For example, in stereo match, hough transform etc., the com- 
putation is proportional to the number of feaunes (edges) or significant pixels in a granule rather than on the granule 
size. Therefore, equal size granules do not guarantee load balanced petitioning because of die dam dependent 
nature of the computation. In many such algorithms, the computation time for a granule (i). T; . is proportional to a 
cerium extent on the granule size (fixed overhead to process a granule), and to the number of significant dam in a 

granule. That is. 


„ . ( 6 ) 

T;=Axdi + Bx.fi 

where, d, is the granule size, A is a measure of significant dam in granule (i), and A and B are arbtuary constants 


which depend on the algorithm. The objective is to divide the computation among processors such that each proces- 
sor receives equal measure of compulation. One way to assign a granule to a processor is to compute the total 


measure of computation and partition is as follows: 


l j?Axdi+Bxfi ( 7 ) 

r, = — — 

where, g is the total number of granules in the input domain (Note that the number of granules for the current task is 
n for an n processor system). 

For example, consider computing hough transform of an edge image to detect line segments. If there exists 
whose n«mal distance from the origin is r, the normal makes an angle 9 wilh the x-axis then if a point (x.y) lies on 

that line, the following equation is satisfied. 

r = xcosQ + ysinQ 

r and 9 arc quantized for desired accuracy and then for each significant pixel (where there is an edge), r is 
for all quantized 9 values. If two partitions of equal size contaiu different number of edge pixels, then the 
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amount of computation will be different for the two partitions despite them being equal in size. In fact, the computa- 
tion is directly proportional to the number of edge pixels in a partition. A way to perform static load balancing will 
be to decompose the input data such that each partition contains equal number of edge pixels. The computation to 
recognize this pardoning can be performed in the task in which edges are detected by keeping a count of the number 
of edges detected by a processor. Once a task is completed, the data can be reorganized such that the number of 

z z 

edges with each processor is in the interval - 5 , -jp + 5), where Z a is the total number of edges detected in 
the image, and 5 is determined by the minimum granule size from fixed overhead considerations. 


3.1 3. Weighted Static 

When the computation in a granule not only depends on number of significant data points in the input domain, 
but it ai<gn depends on their spatial relationships, then data distribution needs to be taken into account as a measure 
of load to perform load balancing. For example, in stereo match or titne match, not only does the computation 
depend on the number of zcto crossings, but it also depends on their spatial distribution. If the zero crossings are 
densely spaced, then the computation will be more than that if the same number of zero crossings are sparsely distri- 
buted. The reason is that if the zero crossings are densely packed, then more number of zero crossings need to be 
matri wt with each corresponding zero crossing in the other image, whereas less number of zero crossings need to 
be matched if they are sparsely distributed. Hence, the computation also depends on the spatial density (such as 
feature s/row if one dimensional matching is performed). That is, 

Ti = A xdi + B xwixdi (8) 

where, Wi is the feature dependent spatial density. For example, if the minimum granule size is a row of the input 
data then w t = rf , where r,- is the number of features in row I, and P is a parameter, 0<(3<1. P =0 means that 
the computation is independent of how the features are distributed within a row. Therefore, to divide the computa- 
tion equally among n processors, the following heuristic can be used. 
i=R 

X Axdi+B xw^i (9) 

_ i=0 

Ti = 

n 

where, R is the number of rows in the image. Note that the above heuristics approximate the load and do not exactly 
divide the computation among processors. However, in the next section we will show that these schemes perform 


well. 
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3.1.4. Dynamic 

Above three methods use the knowledge about the data when it is produced to perform load balancing for the 
ne* task. However, once decomposition is done, then dm dam is not mshufUcd. Thereto*, we consider dm above 
methods as knowledged based static load balancing schemes. In dte dynamic scheme, the dam is decomposed into 

finer granules such that the number of tasks, (that is number of independent granules) M, is much larger than die 

/ 

number of processors. 

At execudon time dte processors am assigned these tasks dynamically by a designated scheduler from a task 

containing these tasks. Processors am assigned new tasks as they finish their previously assigned tasks, if 
them am mom tasks left to be assigned. However, the knowledge obtained from dm previous step can be used again 
andcipam dm compledon of a task, in order to assign a new task to a ptocessor. That is. dte task assignment can 
be pipelined, thereby reducing the overhead of dynamic assignment. 

The following procedure illustrates the dynamic assignment of tasks onto the processor. The pseudo code 
essentially illustrates what the scheduler does in order to perform dynamic load balancing. The number of tasks 
(max.tasks) are determined during the execudon of the preceding step in the system, and the task.queue contains 
all the tadrs including the computational information associated with each task. Initially, the scheduler assigns 
tasks to each processor. The number of tasks to be assigned initially is a parameter (pipe_line_no). If this parameter 
is 1, it implies that there is no anticipatory scheduling. In other words, a processor is assinged a new task only when 
it finishes the task it is currently executing. A task is assigned to a processor only if the task contains significant 
computation. For example, in stereo match, if a task’s data does not contain any zero crossings, then the task can be 
discarded because it is not going to produce any useful information anyway. In a blind scheme, where little is 
known about a task, the task will be assigned, which is an overhead, and can be avoided by using the knowledge 
obtained from the previous steps. Whenever a processor P i completes the current task, it sends a complmsg to the 
scheduler which assigns P, a new task if the task.queue is not empty. Once the task.queue becomes empty, the 
scheduler sends a termmsg (terminate message) to all the processors. Upon receiving a termmsg from the 
scheduler, processors complete the remaining tasks in their task.queues, and sends a termmsg to the scheduler, ter- 
minating the computation. Note that by using the pipe_Une.no, anticipatory dynamic scheduling can be performed, 
and a processor need not be idle when a new task is being assigned. By using this parameter, the amount of initial 

static assignment, and dynamic assignment can be controUed. 
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Dynamic Scheduling of Tasks 
/•Initial Assignment*/ 


1. curr_task = 0; 

2. for j =* 1 to j <= pipe_line_no do 

3. for i = 1 to i = num_proc do 

4. if comp(task_queue(cuir_task)) > 0 


5. 

6 . 


7. 

else 

8. 


9. 


10. 

endjf 

11. 

end_for 

12. 

end_for 


/•Scheduling*/ 


schedule currjask at pioc. r 
curr_task = curr_task+l; 

curr_task = curr_task+l; 
go to 4. 


13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 


done = false; k = num_proc; 
while not done do 

wait for msg from a processor, 

receive msg; 

if ( msg - compljnsg ) 

Pi = sender processor, 
if currjask < max_tasks 

if comp(task_queue(curr_task)) > 0 

schedule currjask at proc. P 
curr_task = curr_task+l; 

else 

curr_task = curr_task+l; 
go to 19. 
else 

send term msg to Pi- 
else if ( msg = termjnsg ) 
k = k - 1; 
if (k <= 0) 

done = true. 


4. Parallel Implementation and Performance Evaluation 

This section presents a parallel implementation of the algorithms that are part of motion estimation system 
and describes the performance of the algorithms and load balancing strategies. 

4.1. Hypercube Multiprocessor 

A hypercube multiprocessor system of size P has P processors, where P is an integral power of 2. P processors 
are indexed by the integers 0,...J»-1 and the following criteria is satisfied. If the processor numbers are represented 
by log 2 (/ 4 5 ) bits then two processors are connected by communication links if and only if their bit representation 
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diHas by exactly one bit Therefore, each processor is connected to log 2 (/>) processors with direct communication 
links. Diameter of the hypereube of site P is log 2 (P) (diameter is the maximum distance between any two nodes). 
We used Intel ipsc/2 hypercube muluptocessoc consisting of 16 nodes. Each node consists of an Intel 80386 proces- 
sor, Intel 80387 co-processor, 4 megabyte memory, and a communication module. 


42. Feature Extraction 

Features used for stereo match algorithms are the zero crossings of the convolution of the image with Lapla- 
cian. Zero crossing computation involves 2-D convolution and extraction of zero crossings from the convolved 
image. Since convolution is a data independent algorithm uniform partitioning is sufficient to evenly distribute the 
computation. The mapping is a division of N*N image onto P processors. Each processor computes the zero cross- 
ings of share of N 2 /P pixels. Data division onto the processors is done along the rows. This mapping reduces com- 
munication to only in one direction. The reason is that 2-D convolution can be broken into two 1-D convolution [7], 
This not only reduces the computation from W 2 sum of products operations per pixel to 2xW sum of product 
operations per pixel (W is the convolution mask window size), but also reduces the communication requirements in 
a parallel implementation if the data partitioning is done along the rows. There is no need for communication when 


convolution is performed along the rows. 

Table 1 shows the performance results for the above implementation for an image of size 256x256 and con- 
volution window of size 20x20. First column shows the number of processors in the cube( P). Second column 
represents the total processing time ( t proc ) for convolution. Column 3 shows the number of bytes communicated by 
a processor to the neighboring processor, and column 4 shows the corresponding communication time which is 
small compared to the computation time. The second half of the table shows the computation time for extracting 
zero crossings from the convolved image. Corresponding speedups are also shown. 

ft can be observed to aldto linear spredup is obtained for convolution. Two factors which contribute 
toward this result are that communicadon overhead is relatively small, and communication is constant as the number 
of processors increases. However, the speedup obtained in the elapsed time, which includes the pregram and darn 
load tune also, is sub-linear due to the following reason. The hypercube multiprocessor’s host does not have a 
broadcast capability, and therefore, the overhead of loading the program increases linearly with the number of pro- 
cessors. However, data load time increment with the increase in the number of processors is comparatively small 
because amount of data to be loaded to one precessor decreases as the nunto of processora increases. The only 
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Table 1 : Perfonnance for feature Extraction (Zero Crossings) 


No. Proc. 

iauiw 

Conv. 

Comp. 

Timelsec.) 


ion for Convo 
Convoluti 
Conv. 
Comm. 
TimeOns.) 


•o Crossings 
ze * 20x20 

Conv. 

SpeedUp 

ZC 

Comp. 

Time(sec.) 


1 

2 

4 

8 

16 

109.0 

54.76 

27.51 

13.88 

7.07 

0 

2816 

5632 

5632 

5632 




1 

1 

1 


Feature Extraction Performance (EJaj 

psed Time) 

No. Proc. 

Elapsed 

Time(sec.) 

Speed up 

1 

116.2 

1 

2 

58.8 

1.97 

4 

30.1 

3.86 

g 

16.1 

7.22 

16 

9.6 

12.1 


increment in data load time results from the number of communication setups from the host to the node processors, 
which increases linearly with the number of processors. 


4.3. Matching Features 

This task involves matching features in stereo pair of images. Since the imaging setup uses the parallel axis 
method, the epipolar constraint is used to limit the search space for matching to one-dimension which is in the hor- 
izontal direction. Thus data pardoning along the rows for parallel implementation results in no communication 
between node processors as long as each partition contains an integral number of rows. 

The computation involved in stereo matching algorithm is data dependent The computation varies across the 
image because it depends on the number of zero crossings, distribution of zero crossing across the image, and distri- 
bution of zero crossings along the epipolar lines. Therefore, pardoning the data uniformly among the processors (i.e. 
assign each processor equal number of rows) may not yield expected speedups and processor utilization. A proces- 
sor which has very few zero crossings, and sparsely distributed zero crossings will be under utilized, whereas a pro- 
cessor with a large number of zero crossings, and densely distributed zero crossings will become a bottleneck. 

We used uniform partitioning, static load balancing, weighted static and dynamic load balancing schemes to 
decompose the computation on the multiprocessor. Static load balancing can be achieved by keeping a count of the 
zero crossings with each processor when the previous task (feature extraction) is executed. At the completion of the 
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ftislc, the data is reorganized using this information, and using the techniques described in the previous section. 

Figure 5 shows the distribution of the computation times for 8 processor case. The X-axis shows the proces- 
sor number, and the Y-axis shows the computation time for each scheme. As we can observe, uniform partitioning 
does not perform well at all because the variation in computation time is tremendous, and therefore, performance 
gains are minimal. The static load balancing scheme (shown as dashed bars) performs much better than uniform par- 
titioning, but variation in computation times is still significant because the computation also depends on the distribu- 
tion of zero crossings. The weighted static scheme performs better than static, and further reduces the variation in 
computation times. Note that these schemes only measure the load approximately, and therefore, will not divide the 
computation exactly uniformly. Furthermore, minimum granularity is a row boundary in order to avoid communica- 
tion between processors. Finally, for 8 processor case, dynamic scheme performs the best. Table 2 summarizes the 
distribution for the 8 processor case. The Table shows the computation time, variation ratio, and improvement ratio 
for Mrh processor under all four methods. Table 2 summarizes the distribution for the 8 processor case. The table 



Processor* 


Figure 5 : Distribution of Computation Times for Stereo Match (P=8) 
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shows Use computation time fo. each processor for all four methods. Fo, example, the variation ratio is 44.25 lot 
uniform parftioning, is 2.71 for static load balancing, is 1.50 for weighted static, and is 1.09 for dynamic load 
Improvement ratio is die rado of speedup obtained with load balancing to that of uniform parddoning. 
The computation times shown include all the overhead of load balancing schemes. Figure 6 shows the speedup 
gmph foe varying size of multiprocessor from 1 processor to 16. We observe that uniform partitioning does not pro- 
vide any significant gains in speedup as the number of processors increases. Dynamic scheme perforais the best 
among all the schemes, and the two static scheme perforo. comparably with the dynamic scheme. We believe that as 
dm number of processors is incased, the two static schemes will move even closer to dynamic scheme, or even 
perform better than the dynamic scheme, because for a larger multiprocessors, the overhead of dynamic scheme will 

be greater. 

4.4. Time Match 

The computation in time match algorithm is similar to that in stereo match except the search space is two- 
dimensional. and the input to the algorithm is stereo match output Other difference is that the number of significant 
points in the input data is much smaller than that in stereo match, because a great deal of input points get eliminated 
in stereo match. Table 3 shows the distribution of the computation times for the 16 processor case. We only present 
uniform partitioning and static load balancing cases. The most important observation is that uniform panitioning 


Table 2 : Distribution of Computation Times for Stereo Match 


fnmnutation Time Distribution for Stereo Match (P-8) 
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2.15 
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Figure 6 : Speedups for Stereo Match Computation 
performs worse than that in the case of stereo match, and static load balancing performs better. 

The Table shows how the measure of computation (number of zero crossings left from stereo match step) is 
divided among the processors in the two cases. It is clear that the number of zero crossings are very evenly distri- 
buted (within the minimum granule of one row constraint) in the static case, whereas they are lumped with a few 
processors in the uniform partitioning case. Figure 7 shows the speedup graphs for the two schemes for a range of 
multiprocessor size. The speedup gains for the load balanced case is very significant over the uniform partitioning 
case. We computed the overhead of performing knowledge based static load balancing, and the overhead was 3 ms., 
which is negligible compared to the computation time, and the performance gains are significant. 

4.5. Second Stereo Match 

This step involves stereo match computation for features from images at time instant t i+ i after time point 
correspondence is established between images at time V t and t M . The matching is similar to that in first stereo 
match except that it need to be done only at those points at which time correspondence has already been established. 
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Figure 7 : Speedup for Time Match 

Consequently, the number of features to be matched are much less than that in the first computation, and hence, the 
importance of load balancing is further increased. Figure 8 depicts the distribution of computation times for the 
second stereo match step. The three load balancing algorithms used in this case are Uniform Partitioning, Stauc and 
Dynamic. We observe from the Figure that uniform partitioning does not perform well compared to the other two 
schemes. The variation in computation time is significant, and the static and dynamic schemes perform comparably. 

Figure 9 presents the speedups for the same algorithm for various multiprocessor sizes. The Figure shows that 
the gains from these load balancing schemes are very significant over uniform partitioning. One important observa- 
tion can be made by comparing results in Figure 6 and 9. Note that the performance of uniform partitioning in the 
second stereo match is much worse than that in the first stereo match. For example, for 16 processor case, the 
speedup in the first case is 5.55, whereas for the same multiprocessor size speedup is only approximately 2.3 for the 
second stereo match. Therefore, as the computation progresses in an integrated environment, the gains of these load 
balancing schemes become increasingly significant. 
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Figure 8 : Distribution of Computation Times for Second Stereo Match (P=8) 
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4.6. Summary of Results 

In summary, tire following important observations can be made from 0* result presented in this secdon. 
Fust the improvement in perfomtance (such as utilization and speedup) itself increases using the load balancing 

ai the number of processors increases. Therefore, performance gains are expected to be higher for larger 

multiprocessors. Second, in an integnued environment, tire overtreads of such methods are small because measure of 
load can be computed at ran time as a bi-product of the cturent task. Bnally, though we showed the performance 
results of tire implementation on the hypercube multiprocessor, these methods can be applied when algorithms are 
mapped on any medium to large grain multiprocessor system, because these techniques are independent of tire 

underlying multiprocessor architecture. 

Consider the overall performance gains for the entire system. As the computation progresses from one step to 
the next, uniform partitioning performs worse because the data points reduce, but the computation at each point 
increases. Hence, the gains of using parallel processing are minimal. However, the load balancing techniques recog- 
nize the data distribution at each step, and the data is decomposed using the distribution. Therefore, performance 
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gains are expected to improve as the computation progresses in an integrated environment. For example, consider 
zero crossing, stereo match, time match, and second stereo match steps. In zero crossing computation, uniform par- 
titioning performs well and load is balanced. Hence, the improvement ratio is 1. For stereo match the improvement 
of static over uniform partitioning is 2.15 for 8 processor case, and is 2.22 for 16 processor case. Similarly, for time 
m atrh step, the improvement of static load balancing for 8 processor case is 3.38, and for 16 processor case, it is 4.2. 
Therefore, the improvement in performance itself increases as the number of processors increases as well as when 
the computation progresses in from one step to the next in a vision system. 

5. Concluding Remarks 

In this paper we presented techniques to perform efficient data decomposition and load balancing for vision 
systems, for medium to large grain parallelism. Two important characteristics of these techniques are that they are 
general enough to apply to any such integrated system, and that they use statistics and knowledge from the execu- 
tion of a task to perform data decomposition and load balancing for the next task in the system. Knowledge from 
each step is used to perform load balancing in the next step. The advantages of such schemes are as follows. First, 
these techniques use characteristics of the tasks and the data, and therefore, work well no matter how the data 
changes. Secondly, many vision systems consist of such tasks and exhibit the above described computation flow, 
and therefore, these techniques can be used in any system. 

Fi nall y, the performance of the proposed techniques was evaluated by using a parallel implementation of the 
motion estimation system algorithms on a hypercube multiprocessor system. The results show that using uniform 
partitioning without considering the computations involved, parallel processing does not provide significant perfor- 
mance improvements over sequential processing. Furthermore, by applying the proposed data decomposition and 
load balancing techniques significant performance gains (as much as 6 fold) can be obtained over uniform partition- 
ing. 



Alok Choudhary 


24 


REFERENCES 


[ 1 ] 

[ 2 ] 

[3] 

[4] 

[5] 

[61 

[71 

[ 8 ] 


C Weems A Hanson, E. Riseman, and A. Rosenfeld, “An integrated image understanding benchmark: 
recognition of a 2 1/2 D mobile,” in International Conference on Computer Vision and Pattern Recognition , 
Ann Arbor, MI, June 1988. 

Alok N. Choudhary, “Parallel architectures and parallel algorithms for integrated vision systems,” in PhD. 
Thesis, University of Illinois, Urbana-Champaign, Agust 1989. 

Mun K. Leung and Thomas S. Huang, "Point matching in a time sequence of stereo image pairs,” in Tech. 
Rep., CSL, University of Illinois, Urbana-Champaign, 1987. 

M K. Leung A N Choudhary, J. H. Patel, and T. S. Huang, “Point matching in a time sequence of stereo 
!^ e paSrnd ia^leTS iplementation on a multiprocessor.” in IEEE Workshop on Visual Motion, 

Irvine, CA, March 1989. 

Alok N Choudhary, Subhodev Das, Narendra Ahuja, and Jarlak H. Patel, “Surface reconstruction from 
^ imag^T^ implementation on a hypercube multiplexor ,” in The Fourth Conference on 
Hypercubes , Concurrent Computers , and Applications , Monterey, CA, March 1989. 

K S. Arun T. S. Huang, and S. D. Blostein, “Least-sqaure fitting of two 3-D point sets ” IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 9, pp. 698-700, September 198 . 

A Huertas and G. Medioni, "Detection of intensity changes with subpixel accuracy using Uplacian- 
Gaussian masks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI 8, pp. 
651-664, September 1986. 

Y. C. Kim and J. K. Aggarwal, "Positioning 3-D objects using stereo images,” Computer and Vision 
Research Center, The University of Texas at Austin. 






